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1. Introduction 

Substantial progress has been achieved over the last years in estimating high- 
dimensional regression models. A thorough introduction to this dynamic field of 
contemporary statistics is provided by the recent monographs Hastie, Tibshirani 
and Friedman [23], Biihlmann and van de Geer [12]. In the popular framework 
of linear and generalized linear models, the Lasso estimator introduced by Tib- 
shirani [42] immediately proved successful. Its theoretical properties have been 
extensively studied and its popularity has never wavered since then, see for 
example Bunea, Tsybakov and Wegkamp [11], van de Geer [44], Bickel, Ritov 
and Tsybakov [10], Meinshausen and Yu [30]. However, even though numer- 
ous phenomena are well captured within this linear context, restraining high- 
dimensional statistics to this setting is unsatisfactory. To relax the strong as- 
sumptions required in the linear framework, one idea is to investigate a more 
general class of models, such as nonparametric regression models of the form 
Y = f(X) + W, where Y denotes the response, X the predictor and W a zero- 
mean noise. A good compromise between complexity and effectiveness is the 
additive model. It has been extensively studied and formalized for thirty years 
now. Amongst many other references, the reader is invited to refer to Stone 
[39], Hastie and Tibshirani [21, 22], Hardle [24]. The core of this model is that 
the regression function is written as a sum of univariate functions / = 53f=i fii 
easing its interpretation. Indeed, each covariate's effect is assessed by a unique 
function. This class of nonparametric models is a popular setting in statistics, 
despite the fact that classical estimation procedures are known to perform poorly 
as soon as the number of covariates p exceeds the number of observations n in 
that setting. 

In the present paper, our goal is to investigate a PAC-Bayesian-based pre- 
diction strategy in the high-dimensional additive framework (p»n paradigm) . 
In that context, estimation is essentially possible at the price of a sparsity as- 
sumption, i.e., most of the /j functions are zero. More precisely, our setting is 
non-asymptotic. As empirical evidences of sparse representations accumulate, 
high-dimensional statistics are more and more coupled with a sparsity assump- 
tion, namely that the intrinsic dimension p of the data is much smaller than 
p and n, see e.g. Giraud, Huet and Verzelen [18]. Additive modelling under a 
sparsity constraint has been essentially studied under the scope of the Lasso in 
Meier, van de Geer and Biihlmann [29], Suzuki and Sugiyama [41] and Koltchin- 
skii and Yuan [25] or of a combination of functional grouped Lasso and backfit- 
ting algorithm in Ravikumar et al. [35]. Those papers inaugurated the study of 
this problem and contain essential theoretical results consisting in asymptotics 
(see Meier, van de Geer and Biihlmann [29], Ravikumar et al. [35]) and non- 
asymptotics (see Suzuki and Sugiyama [41], Koltchinskii and Yuan [25]) oracle 
inequalities. The present article should be seen as a constructive contribution to- 
wards a deeper understanding of prediction problems in the additive framework. 
It should also be stressed that our work is to be seen as an attempt to relax as 
much as possible assumptions made on the model, such as restrictive conditions 
on the regressors' matrix. We consider them too much of a non-realistic burden 
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when it comes to prediction problems. 

Our modus operandi will be based on PAC-Bayesian results, which is original 
in that context to our knowledge. The PAC-Bayesian theory originates in the 
two seminal papers Shawe- Taylor and Williamson [38] , McAllester [28] and has 
been extensively formalized in the context of classification (see Catoni [14, 15]) 
and regression (see Audibert [5, 6], Alquier [1, 2], Audibert and Catoni [8, 9[). 
However, the methods presented in these references are not explicitly designed 
to cover the high-dimensional setting under the sparsity assumption. Thus, the 
PAC-Bayesian theory has been worked out in the sparsity perspective lately, by 
Dalalyan and Tsybakov [16, 17], Alquier and Lounici [4], Rigollet and Tsybakov 
[37]. The main message of these studies is that aggregation with a properly cho- 
sen prior is able to deal effectively with the sparsity issue. Interesting additional 
references addressing the aggregation outcomes would be Rigollet [36] , Audibert 
[7[. The former aggregation procedures rely on an exponential weights approach, 
achieving good statistical properties. Our method should be seen as an extension 
of these techniques, and is particularly focused on additive modelling specifici- 
ties. Contrary to procedures such as the Lasso, the Dantzig selector and other 
penalized methods which are provably consistent under restrictive assumptions 
on the Gram matrix associated to the predictors, PAC-Bayesian aggregation 
requires only minimal assumptions on the model. Our method is supported by 
oracle inequalities in probability, that are valid in both asymptotic and non- 
asymptotic settings. We also show that our estimators achieve the optimal rate 
of convergence over traditional smoothing classes such as Sobolev ellipsoids. It 
should be stressed that our work is inspired by Alquier and Biau [3], which ad- 
dresses the celebrated single-index model with similar tools and philosophy. Let 
us also mention that although the use of PAC-Bayesian techniques are original 
in this context, parallel work has been conducted in the deterministic design 
case by [40]. 

A major difficulty when considering high-dimensional problems is to achieve a 
favorable compromise between statistical and computational performances. The 
recent and thorough monograph Buhlmann and van de Geer [12] shall provide 
the reader with valuable insights that address this drawback. As a consequence, 
the explicit implementation of PAC-Bayesian techniques remains unsatisfac- 
tory as existing routines are only put to test with small values of p (typically 
p < 100), contradicting with the high-dimensional framework. In the mean- 
time, as a solution of a convex problem the Lasso proves computable for large 
values of p in reasonable amounts of time. We therefore focused on improving 
the computational aspect of our PAC-Bayesian strategy. Monte Carlo Markov 
Chains (MCMC) techniques proved increasingly popular in the Bayesian com- 
munity, for they probably are the best way of sampling from potentially complex 
probability distributions. The reader willing to find a thorough introduction to 
such techniques is invited to refer to the comprehensive monographs Marin 
and Robert [26], Meyn and Tweedie [31]. While Alquier and Biau [3], Alquier 
and Lounici [4] explore versions of the reversible jump MCMC method (RJM- 
CMC) introduced by Green [19], Dalalyan and Tsybakov [16, 17] investigate 
a Langevin-Monte Carlo-based method, however only a deterministic design is 
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considered. We shall try to overcome those limitations by considering adapta- 
tions of a recent procedure whose comprehensive description is to be found in 
Petralias [32], Petralias and Dellaportas [33]. This procedure called Subspace 
Carlin and Chib algorithm originates in the seminal paper Carlin and Chib [13], 
and has a close philosophy of Hans, Dobra and West [20] , as it favors local moves 
for the Markov chain. We shall provide numerical evidence that our method is 
computationally efficient, on simulated data. 

The paper is organized as follows. Section 2 presents our PAC-Bayesian pre- 
diction strategy in additive models. In particular, it contains the main theoreti- 
cal results of this paper which consist in oracle inequalities. Section 3 is devoted 
to the implementation of our procedure, along with numerical experiments on 
simulated data, presented in Section 4. Finally, and for the sake of clarity, proofs 
have been postponed to Section 5. 

2. PAC-Bayesian prediction 

Let (p,,A,¥) be a probability space on which we denote by {(X-i,Yi)}2_ 1 a 
sample of n independent and identically distributed (i.i.d.) random vectors in 
(—1, l) p x R, with Xi = (Xn, . . . , X ip ), satisfying 

v 

Y t = i>*(Xi) + 6 = + &> » e {l, . . . ,p}, 

where ip*, ... are p continuous functions (—1, 1) — > R and {£i}"=i is a set of 
i.i.d. (conditionaly to {(Xj, real-valued random variables. Let V denote 

the distribution of the sample {(Xj, Yi)}f =1 . Denote by E the expectation com- 
puted with respect to P and let || • be the supremum norm. We make the 
two following assumptions. 

(Al) For any integer k, E[|£i| fc ] < oo, E[£i| Xi] = and there exist two positive 
constants L and a 1 such that for any integer k > 2, 

Ell&flXi] < ^W- 2 . 

(A2) There exists a constant C > max(l, a) such that ||?A*||oo < C. 

Note that Al implies that E^ = and that the distribution of £i may depend 
on Xi. In particular, Al holds if £i is a zero-mean gaussian with variance 7 2 (X!) 
where x i— >• j 2 (x) is bounded. 

We are mostly interested in sparse additive models, in which only a few 
{ip*}*j =1 are not identically zero. Let {ifik}kLi be a known countable set of 
continuous functions R — > (—1, 1) called the dictionary. In the sequel, |J£| stands 
for the cardinality of a set !K. For any p-th tuple m = (mi, . . . , m p ) 6 N p , denote 
by 5(m) C {1, . . . ,p} the set of indices of nonzero elements of m, i.e., 

p 

|S(m)|=]TlK->0], 

3=1 
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and define 

e m = {6> g R mi x •• • x R m "}, 

with the convention K = 0. The set G m is embedded with its canonical Borel 
field S(0 m ) = 'B(R mi ) ® • • • ® !B(M mp ). Denote by 

e d ^ f (J e m , 

which is equipped with the cr-algebra 7 — a (V m eM ^(®m)), where M is the 
collection of models M = {m = (mi, . . . , m p ) g N p }. Consider the span of the 
set {vfe}feLi) *- e -, the set of functions 

^0 = Z Z Z^ fc:6leem ' mG:M f > 

jeS(m) j£S(m)fc=l J 

equipped with a countable generated cr-algebra denoted by 3\ The risk and 
empirical risk associated to any ip$ g F are defined respectively as 

i?(^)=E[Y 1 -^(X 1 )] 2 and J?„(^) = r n ({Xj, Y-}? =1 , i>e), 

where 

1 ™ 

r„({xi,2/i}^ =1 ,Ve) = - V (i/i - iMxi)) 2 • 
n 

Consider the probability r/ Q on the set M defined by 

for some a G (0, 1/2). Let us stress the fact that the probability r/ a acts as a 
penalization term over a model m, on the number of its active regressors through 
the combinatorial term (ig^i) and on their expansion through a^- 1 mj . 

Our procedure relies on the following construction of the probability tt, re- 
ferred to as the prior, in order to promote the sparsity properties of the target 
regression function ip*. For any meM,(>0 and x € m , denote by H x m (x, Q 
the ^ 1 -ball centered in x with radius £. For any m g M, denote by 7r m the 
uniform distribution on 23 m (0,(7). Define the probability ir on (0,T), 

n{A) = Va(m)Tr m (A), Ae7. 
Note that the volume V m (C) of 3^,(0, C) is given by 

V m (C) = -. r- = -. r-. 

r ( Ejes(m) m 3 + 1 ) ( Ejes(m) TO J ■ ) ! 
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Finally, set S > (which may be interpreted as an inverse temperature param- 
eter) and the posterior Gibbs transition density is 



M{(^)}"=i,0) 

V m (C) ^ / exp[-£r n ({x^ yi }? =1 , 1> e )]ir(M) 



We then consider two competing estimators. The first one is the randomized 
Gibbs estimator constructed with parameters 9 sampled from the posterior 
Gibbs density i.e., for any A E 3\ 

P(* G A|{X i5 ^}f =1 ) = / psUX^Y^^Mde), (2.2) 

J A 

while the second one is the aggregated Gibbs estimator fy A defined as the pos- 
terior mean 

* A = [ rPe P 6({X i ,Y i }Y =1 ,e)n(de)=E[*\{X i ,Y i }? =1 l (2-3) 

These estimators have been introduced in Catoni [14, 15] and investigated in 
further work by Audibert [5], Alquier [1, 2], Dalalyan and Tsybakov [16, 17]. 

For the sake of clarity, denote by CD a generic numerical constant in the sequel. 
We are now in a position to write a PAC-Bayesian oracle inequality. 

Theorem 2.1. Let ij) and ?/> A be realizations of the Gibbs estimators defined 
by (2.2)- (2.3), respectively. Let Al and A2 hold. Set w = 8Cmax(L, C) and 
S = n£/[w + 4:(a 2 + C 2 )], for £ G (0,1), and let e G (0,1). Then with P -probability 
at least 1 — 2e, 

*$L~ R( fl \ < d inf inf - W) 

i?(</> A ) - i?(V> ) )- meM 0eS ^( O ,c) 1 V ; V y 

+ |5 (m) | l0 ^ /|S(m)l) + ^M E mi + «Ml (2.4) 

jGS(m) J 

where T> depends upon w, a , C, £ and a defined above. 

Under mild assumptions, Theorem 2.1 provides inequalities which admit the 
following interpretation. If there exists a "small" model in the collection M, 
i.e., a model m such that X^es(m) m i anc ^ l^( m )l are small, such that ipg (with 
9 G ©m) is close to ip*, then if; and ■0 A a re also close to -0* up to log(n)/n 
and log(p)/n terms. However, if no such model exists, at least one of the terms 
SjeS(m) m j/ n an d ^(m)!/?!, starts to emerge, thereby deteriorating the global 
quality of the bound. A satisfying estimation of ip* is typically possible when 
ip* admits a sparse representation. 

To go further, we derive from Theorem 2.1 an inequality on Sobolev ellipsoids. 
We show that our procedure achieves the optimal rate of convergence in this 
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setting. For the sake of shortness, we consider Sobolev spaces, however one can 
easily derive the following results in other functional spaces such as Besov spaces. 
See Tsybakov [43] and the references therein. 

The notation {<fk}kLi now refers to the (non-normalized) trigonometric sys- 
tem, defined as 

ipi : 1 1-> 1, ifi2j ■ t H> cos(ir jt), y>2j+\ '■ t H> sm(irjt), 

with j £ N* and t £ (—1,1). Let us denote by S* the set of indices of non- 
idcntically zero regressors. That is, the regression function ip* is 

r = E fl- 
ies* 

Assume that for any j £ S*, ip* belongs to the Sobolev ellipsoid W(rj,dj) 
defined as 

{oo oo 

/6L 2 ([-l,l]):/ = ^te and ^ i 2 ' 0; • d 3 

k=l i=l 

for unknown regularity parameters ri,...,r\s*\ > 2. Let us stress the fact 
that this assumption casts our results onto the adaptive setting. It also im- 
plies that ip* belongs to the Sobolev ellipsoid W(r, d), with r = mixij^s* r j an d 
d — maxjgg* dj, i.e., 

oo 

jes* fc=i 

Further, make the following assumption. 

(A3) The distribution of the data CP has a probability density with respect to 
the corresponding Lebesgue measure, bounded from above by a constant 
B > 0. 

Theorem 2.2. Le£ ip and ip A be realizations of the Gibbs estimators defined by 
(2.2)-(2.3), respectively. Let Al, A2 and A3 hold. Set w = 8Cmax(i,C) and 
S = n£/[w + 4(a 2 + C 2 )}, for £ £ (0, 1), and let e £ (0, 1). Then with V -probability 
at least 1 — 2e, 

Rm - Rm J - 

where D is a constant depending only on w, a, C, £, a and B. 

Theorem 2.2 illustrates that we obtain the minimax rate of convergence over 
Sobolev classes up to a log(n) term. Indeed, the minimax rate to estimate a single 
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function with regularity r is n 2r + 1 , see for example Tsybakov [43, Chapter 2]. 
Theorem 2.1 and Theorem 2.2 thus validate our method. 

A natural extension is to consider sparsity on both regressors and their ex- 
pansion, instead of sparse regressors and nested expansion as before. That is, 
we no longer consider the first rrij dictionary functions for the expansion of rc- 
gressor j. To this aim, we slightly extend the previous notation. Let if G N* be 
the length of the dictionary. A model is now denoted bym= (mi, . . . , m p ) and 
for any j G {1, . . . ,p}, nx, — (toj-i, . . . , frtjK) is a if-sized vector whose entries 
are 1 whenever the corresponding dictionary function is present in the model 
and otherwise. Introduce the notation 

5(m) = { mj ^ 0, j G {1, . . . ,p}}, S(mj) = {m jk ^ 0, k G {1, . . . , K}}. 

The prior distribution on the models space M is now 



l -^-^r ( P V 1 TT ( K 



-l 



for any a G (0, 1/2). 

Theorem 2.3. Let -0 and -0 A be realizations of the Gibbs estimators defined 
by (2.2)-(2.3), respectively. Let Al and A2 hold. Set w — 8Cmax(L, C) and 
S = n£/[w + 4(a 2 + C 2 )}, for £ G (0,1), and let e G (0,1). Then with V -probability 
at least 1 — 2e, 

r$)- Rm i r ( } _ W) 

+|g(m)| logW(ml + logM) E 1^)1 + ^ 
where T> depends upon w, a , C , I and a defined above. 



3. MCMC Implementation 

In this section, we describe an implementation of the method outlined in the 
previous section. Our goal is to sample from the Gibbs posterior distribution p$. 
We use a version of the so-called Subspace Carlin and Chib (SCC) developed 
by Petralias [32] , Petralias and Dellaportas [33] which originates in the Shotgun 
Stochastic Search algorithm (see Hans, Dobra and West [20]). The key idea of 
the algorithm lies in a stochastic search heuristic that restricts moves in neigh- 
borhoods of the visited models. Let T G N* and denote by {8(t),m(t)}J =0 
the Markov chain of interest, with 9(t) £ O m (t)- Define i: t i-> {+,—,=}, 
the three possible moves performed by the algorithm: an addition, a dele- 
tion or an adjustment of a regressor. Let {ei, . . . ,e p } be the canonical base 
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of MP. For any model m(t) = (mi(t), . . . ,m p (t)) G M, define its neighborhood 
{V + [m(i)],V-[m(t)],V = [m(i)]}, where 

V+[m(i)] ={k6M:k = m(t)+a;e j ,^N*,j G {l,...,jj}\5[m(t)]}, 

V"[m(t)] = {k G M: k = m(i) -mj(t)ej,j £ S[m{t)}}, 

and 

V = [m(t)] ={keM: 5(k) = 5[m(t)]}. 

A move is chosen with probability By convention, if S'[m(t)] = p 

(respectively S'[m(t)] = 1) the probability of performing an addition move (re- 
spectively a deletion move) is zero. Note £: {+, — } h-> { — , +} and let D m be 
the design matrix in model m G M. Denote by LSE m the least square estimate 
LSE m = (D^Dm)- 1 D' m Y (with Y = (Y u . . . , Y n )) in model m G M. For ease 
of notation, let 3 denote the identity matrix. Finally, denote by 0(-;/i, T) the 
density of a Gaussian distribution N(/i, T) with mean \x and covariance matrix 
F. A description of the full algorithm is presented in Algorithm 1. 
The estimates ^ and ^ A are obtained as 

p K 

3=1 fe=l 

and for some burnin b G {1, . . . , T — 1}, 

p K / T \ 

* A = EEU^ E 

j=l k=l \ l=b+l / 

The transition kernel of the chain defined above is reversible with respect to 
Ps ® r) a , hence this procedure ensures that {6(t)}J =1 is a Markov Chain with 
stationary distribution p$. 

4. Numerical studies 

In this section we validate the effectiveness of our method on simulated data. 
All our numerical studies have been performed with the software R (see R Core 
Team [34]). The code is available upon request to the corresponding author. 

Some comments are in order here about how to calibrate the constants C, er 2 , 
S and a. Clearly, a too small value for C will stuck the algorithm, preventing the 
chain to escape from the initial model. Indeed, most proposed models will be 
discarded since the acceptance ratio will frequently take the value 0. Conversely, 
a large value for C deteriorates the quality of the bound in Theorem 2.1, Theo- 
rem 2.2, Theorem 2.3 and Theorem 5.1. However, this only influences the theo- 
retical bound, as its contribution to the acceptance ratio is limited to log(2C). 
We thereby proceeded with typically large values of C (such as C = 10 6 ). As 
the parameter a 2 is the variance of the proposal distribution <f>, the practioner 
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Algorithm 1 A Subspace Carlin and Chib-based algorithm 



Initialize (0(0), m(0)). 
for t = 1 to T do 

Choose a move i(t) with probability q[i(t)]. 

For any k S V l ^> [m(t — 1)], generate 9^ from the proposal density </>(•; LSE^, u 2 J). 
Propose a model k £ V l '- t '[m(t — 1)] with probability 

7 (m(t-l),k): 



£jeV*C*)[m(t-i)] 4] 
where 

A- = 

" J 0(6» j; LSEj,CT 2 J)' 

6: if t(t) 6 {+, -} then 

7: For any h 6 yi^W) [k], generate h from the proposal density </>(■; LSE h , cr 2 J). Note 

that m(i- 1) e V«W*»[k]. 
8: Accept model k, i.e., set m(t) = k and 6(t) = 6^, with probability 



a = nun II A ^W)hO^ =>(t - 1)) 



^m(t-i)gK(i(t))h(m(t-l),k) 



gftffll Sh6V'W[m(t-l)] A h 
' ?£(«(*))] E h6 V«W))[k] A h 



Otherwise, set m(t) = m(t — 1) and 0(i) = 8 m (t—i)- 
else 

Generate 9 m u_i) from the proposal density </>(■; LSE m( - t _ 1 ) , cr 2 3). 
Accept model k, i.e., set m(t) = k and 6(t) = 0^, with probability 

, A k7 (k,m(t-1)) 
o = mm 1, 



^m(t-i)7(m(i- l),k) 

Otherwise, set m(t) = m(i — 1) and 0(t) = 9 m (t-i)- 
12: end if 
13: end for 



should tune it in accordance with the noise level of the data. The parameter 
requiring the finest calibration is 5: the convergence of the algorithm is sensi- 
tive to its choice. Dalalyan and Tsybakov [16, 17] exhibit the theoretical value 
S = n/Acr 2 . This value leads to very good numerical performances, as it has 
been also noticed by Dalalyan and Tsybakov [16, 17], Alquier and Biau [3[. The 
choice for a is guided by a similar reasoning to the one for C. Its contribution to 
the acceptance ratio is limited to a log(l/a) term. The value a = 0.25 was used 
in the simulations for its apparent good properties. Although it would be com- 
putationally costly, a finer calibration through methods such as cross-validation 
is possible. 

Finally and as a general rule, we strongly encourage practitioners to run 
several chains of inequal lengths and to adjust the number of iterations needed 
by observing if the empirical risk is stabilized. 

Model 1. n = 200 and S* = {1, 2, 3, 4}. This model is similar to Meier, van de 
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Table 1 

Each number is the mean (standard deviation) of the RSS over 10 independant runs. 



MCMC 


p = 50 
3000 it. 


p = 200 
10000 it. 


p = 400 
20000 it. 


Model 1 
Model 2 
Model 3 


0.0318 (0.0047) 
0.0411 (0.0061) 
0.0665 (0.0421) 


0.0320 (0.0029) 
0.1746 (0.0639) 
0.1151 (0.0399) 


0.0335 (0.0056) 
0.2201 (0.0992) 
0.1597 (0.0579) 



Geer and Biihlmann [29, Section 3, Example 1] and is given by 
with 



V>* : x i— > — sin(2;r), : x i— > x 3 , ip^: x ^ x, 

il>l:x^e- x -e/2, &~N(0,0.1), ie{l,...,n}. 

The covariates are sampled from independent uniform distributions over 
(-1,1)- 

Model 2. n = 200 and S* — {1, 2, 3, 4}. As above but correlated. The covariates 
are sampled from a multivariate gaussian distribution with covariance matrix 

e« = 2-i*-'*i- a , i,je{i,..., P }. 

Model 3. n = 200 and S* = {1, 2, 3, 4}. This model is similar to Meier, van de 
Geer and Biihlmann [29, Section 3, Example 3] and is given by 

Yi = 5^(X iX ) + 3V£(X i2 ) + 4^(X l3 ) + 6^(X i4 ) + 

with 

i* ,* M 2 ,\ /* sin(27ra;) 

2 — sm(27rx) 

V»4 : x H> 0.1sin(27ra;) + 0.2 cos(27nc) + 0.3 sin 2 (27ra) + 0.4 cos 3 (27ra;) 

+ 0.5sin 3 (27ra), & ~ N(0, 0.5), i € {1, . . . , n}. 

The covariates are sampled from independent uniform distributions over (—1, 1). 

The results of the simulations are summarized in Table 1 and illustrated by 
Figure 1 and Figure 2. The reconstruction of the true regression function ip* is 
achieved even in very high-dimensional situations, pulling up our method at the 
level of the gold standard Lasso. 



5. Proofs 

To start the chain of proofs leading to Theorem 2.1, Theorem 2.3 and The- 
orem 2.2, we recall and prove some lemmas to establish Theorem 5.1 which 
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consists in a general PAC-Bayesian inequality in the spirit of Catoni [14, Theo- 
rem 5.5.1] for classification or Catoni [14, Lemma 5.8.2] for regression. Note also 
that Dalalyan and Tsybakov [17, Theorem 1] provides a similar inequality in the 
deterministic design case. A salient fact on Theorem 5.1 is that the validity of 
the oracle inequalities only involves the distribution of the noise variable £i , and 
that distribution is independent of the sample size n. 

The proofs of the following two classical results are omitted. Lemma 5.1 is 
a version of Bernstein's inequality which originates in Massart [27, Proposition 
2.19], whereas Lemma 5.2 appears in Catoni [14, Equation 5.2.1]. 

Fig 1: Estimates (red dashed lines) for ip^, ip^i ^3 an d ip£ (solid black lines). 
Other estimates (for ipj, j ^ {1,2,3,4}) are mostly zero. 

(a) Model 1, p = 200. (b) Model 1, p = 400. 




(c) Model 2, p = 50. (d) Model 3, p = 50. 




For x £ M, denote (x) + = max(i, 0). Let /ij, // 2 be two probabilities. The 
Kullback-Leibler divergence of \i\ with respect to fi 2 is denoted XL(ni, fj, 2 ) and 
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is 

3CC(/ii,/i 2 ) 



7log(£)d/x 1 if M l<M2, 
oo otherwise. 



Finally, for any measurable space (A, A) and any probability ir on (A, .A), denote 
by M}]_ W (A,A) the set of probabilities on (A, A) absolutely continuous with 
respect to ir. 

Lemma 5.1. Let (7i)" =1 be independent real-valued variables. Assume that 
there exist two positive constants v and w such that, for any integer k > 2, 

Then for any 7 £ (0, ~), 



E 



expUj^CTj-ETi) 



< 



exp 



9 

i>7 



2(1 - 107) 



Lemma 5.2. Le£ (A, .A) 6e a measurable space. For any probability p on (A, A) 
and any measurable function h : A — > K such that /(expo ft,)d/i < oo, 

log / (expo/i)d/x= sup / hdm — 0CL(m, fi), 

J m£M.\^(A,A)J 

with the convention 00 — 00 = —00. Moreover, as soon as h is upper-bounded 
on the support of \x, the supremum with respect to m on the right-hand side is 
reached for the Gibbs distribution g given by 

dg exp(/i(a)) 
d/i J (exp o /ijd/i 

Theorem 5.1 is valid in the general regression framework. In the proofs of 
Lemma 5.3, Lemma 5.4, Lemma 5.5 and Theorem 5.1, we consider a general 
regression function if>* . Denote by (0,T) a space of functions equipped with a 
countable generated cr-algebra, and let tt be a probability on (8,T), referred to 
as the prior. Lemma 5.3, Lemma 5.4, Lemma 5.5 and Theorem 5.1 follow from 
the work of Catoni [14], Dalalyan and Tsybakov [16, 17], Alquier [2], Alquier 
and Biau [3]. Let S > and consider the so-called posterior Gibbs transition 
density ps with respect to 7r, defined as 

In the following three lemmas, denote by p a so-called posterior probability 
absolutely continuous with respect to it. Let -0 be a realization of a random 
variable ^ sampled from p. 
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Fig 2: plot of the responses Yi, . . . , Y n against their estimates. The more points 
on the first bisectrix (solid black line), the better the estimation. 

(a) Model 1, p = 200. (b) Model 1, p = 400. 




Lemma 5.3. Let Al and A2 hold. Set w = 8Cmax(L, C), 5 £ (0, n/[w + A(a 2 + 
C 2 )]) and £ € (0, 1). Then with V-probability at least 1 — e 

RW - Rty*) < l jgg+g) [Ml') ' RniP) + l0g ^W +1 ° g n . 

n—wS \ / 

Proof. Apply Lemma 5.1 to the variables T{ defined as follow: for any ij) G, 

Ti = —(Yi — ?/>(Xj)) 2 + (Yi — ijj*(Xi)) 2 , i6{l n}. (5.2) 

First, let us note that 

Rty) - R(r) = H(Yi - ^(Xx)) 2 ] - E[(Yt - ^(Xi)) 2 ] 

= EK2Y, - ^(Xi) - ^*(X 1 ))(^(X 1 ) - V^Xi))] 

- E [(V*(X X ) - V>(Xi)) E[(2Wi + ^*(Xi) - V(X0)| X x ]] 

= 2E[(^*(X 1 )-^(X 1 ))E[a|X 1 ]] + E[^*(Xi) - ^(Xi)] 2 . 



AsE^ilXi] =0, 

Rty) - R(i>*) = E[^*(X) - V(X)] 2 . 



(5.3) 
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By (5.2), the random variables (Tj)™ =1 are independent. Using Lemma 5.1, we 
get 



= 5> [(2y< - ^(xo - ^(x i )) 2 (^(x i ) - y/(x 4 )) 2 ] 

i=l i=l 

= 5^EE[(2W i +V*(X i ) -V^(X. i )) 2 (^(X l )-^(X l )) 2 |X l ] . 

i=l 

Next, using that \a + b\ k < 2 fc ~ 1 (|a| + |6|) for any a, b G R and fc e N* , we get 

n n 

£eT? < 2^E [(^(Xi) - V*(X,)) 2 E [(4W? +4C 2 )|X]] 

i=l i=l 

n 

< 8 (a 2 + C 2 ) £ E [(V>(X) - ^(Xi)) 2 ] 
i=l 

= 8ti (a 2 + C 2 ) (R(tp) - d = «, (5.4) 

where we have used (5.3) in the last equation. It follows that for any integer 
k > 3, 

n n 

5>[m)t] = £ EE [(^)tl x *] 

i=l i=l 
n 

< ^EE [\2Yi - 1>(Xi) - ^(XitfMXi) - V*(Xi)| fc | Xi] 

i=l 
n 

= ^EE [|2Wi + V*(X,) - V(X l )| fc |V(X I ) - ^(X,)| fc | Xi] 

i=l 
n 

< 2*" 1 ^ EE [(2 fc |^|* + ^(Xi) - ^(X,)| fc ) |^(X) - ^(X,)| fc | X,] . 
i=i 

Using that |V>(xi) - 7/>*(x l )| fc < (2C) fc ~ 2 |^(x t ) - ^*(x;)| 2 and (5.3), we get 

n n 

£E[(Ti)5.] < 2 fc ~ 1 ^ (2 fc " 1 fc! ( 7 2 L fc - 2 + (2C) fe ) (2C) fe ~ 2 [i?(i/;) - R{^*)} 



i=l 



; | / n2fc-4„2 rfe-2 , 2 2fc-4/^fc 



Recalling that C > max(l, a) gives 

2 2k-i a 2 L k-2 + 2_2 2k ~ A C k 4 fe -V 2 L fe - 2 , |4 fc - 2 C fe 



ct 2 + C 2 2(T 2 C 2 



< i(4L) fe - 2 + i(4C) fc - 2 = [4max( J L,C)] fe - 2 . 
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Hence 



kl 



^E[(T,)+] < -vw k ~ 2 , with w = 8Cmax(L,C). 

i=l 

Applying Lemma 5.1, we obtain, for any real S € (0, — ), with 7 = 

Eexp[S(R n (^) - R n {^) + R(i>) - R(i/>*))] < exp 
that is, that for any real number e £ (0, 1), 



(5.5) 



v6 2 



2n 2 (1 



Eexp 



6[R n (^*) ~ R n {i>)] + S[R{^) - R(<>P*)} 1 



4S(a 2 + C 2 ) 
n — wd 



loe 



< e. (5.6) 



Next, we use a standard PAC-Bayesian approach (as developed in Audibert 
[5], Catoni [14, 15], Alquier [2]). For any prior probability it on (0,T), 



5[R{i>) - RW*)] 1 - 



Eexp 



By the Fubini-Tonelli theorem 



4J(f7 2 + C 2 ) 
n — wS 



+d[R n (rP*)-R n ( 1 p)]-loi 



7r(di/>) < s. 



E / exp 



S[R(iP) - Rty*)] 1 - 



4S(a 2 + C 2 ) 
n — wd 



+5[R n (->p*)-R n (iP)}-lo£ 



n(dip) < e. 



Therefore, for any data-dependent posterior probability measure p absolutely 
continuous with respect to tt, adopting the convention 00 x = 0, 



E / exp 



5[R(1>) ~ R(r)} 1 



AS(a 2 + C 2 ) 
n — w5 



+s[R n (r) - R n m - log^w - log - 

an e 



p(dip) < e. (5.7) 



Recalling that E stands for the expectation computed with respect to P, the 
integration symbol may be omitted and we get 



Eexp 



6[RW) - R(r)} 1 - 



4:6((J 2 + C 2 ) 
n — wS 



+ S[R n (r) - RnW] ~ l0g^(^) - log 1 

d7r e 



< e. 
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Using the elementary inequality enp(Sx) > 1r + (x), we get, with P-probability 
at most e 



1 



A5{a 2 +C 2 ) 
n — wS 



[R(i>) - R{pl>*)\ > Rnty) - RnW) 



log^W + logi 



Taking 6 < n/[w + 4(ct 2 + C 2 )] implies 



1 _^ + f) >0 , 



n — w8 

and with P-probability at least 1 — e, 

1 



R(i/>) - R(i>*) < 



iS(a 2 +C 2 ) 
n—w5 



1 



R n (^)-R n (^) 



log^^ + logi 



□ 



Lemma 5.4. Let Al and A2 hold. Set w = 8Cmax(L, C), 6 G (0, n/[w + 4(a 2 + 
C 2 )]) and £ G (0, 1). Then with V -probability at least 1 — e 



R n (ll>)p{&4>) ~ RnW) < 



1 



AS(a 2 + C 2 ) 
n — wS 

-R(r 



ac£( P ,^) + iogi 



(5.8) 



Proof. Set V G and Z t = (Y, - ^(X,)) 2 - (Yi - ip*(Xi)) 2 , i G {1, . . . , n}. Since 
Zi = —Ti where Tj is defined in (5.2), using the same arguments that lead to 
(5.6), we get that for any 5 G (0, G n/w) and e G (0, 1) 



E / exp 



-6{R(i>) - Rty*)] 1 + 



46{a 2 + C 2 ) 
n — w5 

+S{R n (4>) - Rn(tp*)} - log ^ (rp) - log - 

d7T £ 



p(dtp) < £. 



Using Jensen's inequality, we get 



Eexp 



5[R(i>) - R(r)} 1 



A5(a 2 + C 2 ) 
n — w5 



+S[R n (yj) - R n {r )] - log ^ WO - log -)p(dip) 

U7T e J 

Since exp(fe) > we obtain with P-probability at most e 



< e. 



- / R{ii>) P (dii>) + R(r) 



46(a 2 + C 2 ) 



R n (<P)p(diP) 



, ^ XL(p,n) -flog 1 

RnW) —z — > 0. 
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Taking S < n/[w + 4(ct 2 + C 2 )] yields (5.8). □ 

Lemma 5.5. Let Al and A2 hold. Set w = 8Cmax(L, C), 5 £ (0, n/[w + A{a 2 + 
C 2 )]) and e € (0, 1). TTien P-probability at least 1 — e 



4S(a 2 +C 2 ) 
n—w5 



, ac£( P ,^) + iogi 



Proof. Recall (5.7). By Jensen's inequality, 



E exp 



fl(V)p(<ty) - R(->P*) 



45{a 2 + C 2 
n ~ wS 



+S \R n W)- j R n (^)p{d^ -XC(p,7r)-logi 



< e. 



Using exp(5a;) > 1r + (:e) yields the expected result. 



□ 



Theorem 5.1. Let ip and ip A be realizations of the Gibbs estimators defined 
by (2.2)-(2.3), respectively. Let Al and A2 hold. Set w = 8Cmax(L, C) and 
S = n£/[w + 4(ct 2 + C 2 )], for i G (0, 1), and let e € (0, 1). Then with probability 
at least 1 — 2e, 



R{tp A ) - R{ip*) 



< V inf 



R(Tp)p(dip) 



(5.9) 



where D is a constant depending only upon w, a, C and I. 



Proof. Recall that the randomized Gibbs estimator ^ is sampled from p$. De- 
note by ^ a realization of the variable By Lemma 5.3, with P-probability at 
least 1 — e, 



R(ti>) - Rm < 



i 

IE 

n—w5 



RM)-R n {^) + 



, i 



Note that 



1 d(>5 ( / 1 = 1 exp[-5R n (i))] 
g d^ W S J exp[-<5.R„ WMd^) 

= -SRntf) - log [ expl-SRnWMty)- 
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Thus, with P-probability at least 1 — e, 

R$) - RW) < 4i( 1 ff 2 +c3) (-Rni^) -\ lo %J e X p[-«4(V0M<ty>) 

n—wS 



i, i 

o e 



By Lemma 5.2, with P-probability at least 1 — e, 



R$) - R{V) < J a 2 +C2) inf ( / R n {i>)p{^) - R n m 

n—wo ~> 

XL{ P ,tt) +logi 



5 

Finally, by Lemma 5.4, with P-probability at least 1 — 2e, 

x 4<5( CT 2 +C 2 ) , 

« - w) ^ i peM1 inf (e , T) { y ^ w) - w 

n—w5 ~^ ,Tr ' 

2 JCC(p,7r)+logl | 

n—w8 ) 

Apply Lemma 5.5 with the Gibbs posterior probability defined by (5.1). With 
P-probability at least 1 — e, 

#(V)MdV0 - RW*) < - — J„2+™ ( I Rn{i>)ps{W) - RnW) 



, _ 4<5[> 2 +C 2 ) 
n—w8 



XL(ps,w)+ log \ 
5 



Note that 



nrrl \ f i exp -6i? n (l/>) , , ,, 

XC(p s ,7r) = / log- rr PaW 

7 J exp[-di?„(V')j7r(d-0) 

= -5 / J^WOPaW) - log f / exp[-5i? n (V')]7r(d^) 



By Lemma 5.2, with P-probability at least 1 — e 

R{i>)p 5 {W) - R(r) < 4 sL +c2) jnf ( I Rn{i>)p{&i>) 

1 - +( r J peM 1 , (e,T) u 

n—wo ~ f 

-RnW ) H ; 
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By Lemma 5.4, with P®"-probability at least 1 - 2e 

1 , 48(a 2 +C 2 ) , 

RWpsW) - RW) < iS ZlU , inf / 

1 - ^ +V ) peM 1 , n (e,7) [J 

n—wo ~> 

) + igg+cg j 1 ■ 

n—w5 ) 

As is a convex function, applying Jensen's inequality gives 
Finally, note that 



1 4S(**+C) (l-£)(w + 4(7 2 +AC 2 )' 



%—w8 



□ 



Proof of Theorem 2.1. Let p € M^ 7r (9,T). For any A e T, note that p(A) = 
EmeM Pm(-A) where p m (-) = p(-n8 m ), the trace of p on 6 m . By Theorem 5.1, 
with P-probability at least 1 — 2e 

R{iJj) - RU/i*) < D inf inf 

m£M P eM\ w (e,7) 

R W p m ^)-Rm + mPm ' 7r) + l ° g " ). (5.10) 



Note that for any p £ M.+ ^(O, T) and any m £ M, 

XC(/5 m ,7r) = y log ^^r^ dp m + y log (^j^) d Pm 

/ \p+l' 
1-1 - 9L - 

XC(p m ,7r m )+log(l/a) m j +lo z(\,q{' m \\) +log ' 

jes(r 



|5(m)|7 ° 1-^ 



Next, using the elementary inequality log (?) < k\og(ne/k) and that < 1, 



XL{p m ,TT)<XL{ Pm ,TT m )+\og(l/a) TO 3 - + |5(m)|log 



+ log 



V|5(m)| 
1 - a 



1 - 2a 



We restrict the set of all probabilities absolutely continuous with respect to 7r m 
to uniform probabilities on the ball 23m( x ; C)i with x g 23^(0; C) an d < C < 
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C — \\0\\i. Such a probability is denoted by /x x ,c- With P-probability at least 
1 — 2s, it yields that 



RM>) - RM>*) < T> inf inf inf { I R(tpg)ne c(d0)- 



3CC(^, c ,7T m ) + log X - + \S(m)\ log 



E - 

jes(m) 



Next, note that 



XC(/x 0;C ,7r m ) = log ( ) = lo S E ""J 



Note also that 



y E[yj 



and 



y e [Yx - ^(xo + m*i) - ^-(xo] 2 H6, c (de) 

= J R(i> e )»e,dM) + J E[MXi)-M*i)] 2 H>,dM) 

+ 2 J E{[Yi - ^(XOl^^XO - ^(X!)]}^,^^). 

Sincere 3L(«,C), 

y E^^xo-v^f^^d^ 



= / E 



E Xfojfc ~ Ojk)tPk(Xij) 

j£S(m) fc=l 



and by the Fubini-Tonelli theorem, 

2 y E{[r 1 -V 9 (X 1 )][^(X 1 )-^-(Xi)]} W , c ((W) 



= 2E 



= 0, 
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since J ipg(Ki)[iQ^(d6) — ^(Xx). Consequently, as 
we get 

J R(ip s )fx ex (de)<R(ip 8 ) + C 2 . 
So with P-probability at least 1 — 2e, 



R(ip) - R(ip*) < D inf inf inf I R{ib e ) + ( 2 - R(ib*) 



log(C/C) E m ^ +h ^ + ^ m ^(\dn)-\) + £ 



3 



The function t > i 2 +log(C/t) X)jGS(m) m j/ n is convex. Its minimum is unique 
and is reached for £ = EjeS(m) ^j/(2n)] 1 ^ 2 . With P-probability at least 1 — 2s, 



R{ip) - R(if)*) < D inf inf { ROtpg) - RN>*) 
meMegs^(o,c) 



+ |5 (m) | l0s( ^ (m)l ^^ £ + 

U jes(m) 

where 2) is a constant depending only on w, <r, C, £ and a. As the same inequality 
holds for ip A , this concludes the proof. □ 

Proof of Theorem 2.2. Recall Theorem 2.1. A3 gives 

R{i> e ) - R(ip*) = J OM*) " f(x)) 2 d?(x) < B J (<Mx) ~ ^(x)) 2 dx. 
For any m € M, define 

p rxij 
j=l fc=l 

Recalling (5.3) and A3, for a m £ M we may now write that 



inf Rfa) - Rty*) < R(r m ) - R{V) < B I (V>*(x) - ^(x)) 2 dx 



// V oo \ 2 

E E W«) dx - 
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As {<v9fc}feLi forms an orthonormal basis, 



D 



r ( P oo \ 2 poo 

EE dx = B E E (^-) 2 



As a consequence, with P-probability at least 1 — 2e, 



m [ j'es* 



2r, 



log(n)} 



*,log(p/|S*|) , log(l/e) 



+\S , 

n n 

where CD is the same constant as in Theorem 2.1. For any r > 2, the function 



t h-> <i,fl+i) 2r j -f lo s(") ^ j g convex anc j admits a minimum in ( ^° s (") ) 2r3+1 _i. 

J V / 7i V 2rjdjn 1 

Accordingly, choosing rrij ~ ^ 2° S d"i ) — yields that with P-probability 

at least 1 — 2e. 



jes* 



log(n) 
2nr\- 



\S 



log(l/e) 



where D is a constant depending only on a, u>, cr, C, £ and -B, and that ends 
the proof. □ 

Proof of Theorem 2.3. The proof is similar to the proof of Theorem 2.1. From 
(5.10) and for any p e Mi OT (6,T) and any m e M, 



XC (p m , tt) = 3C£ (p m , 7r m ) + log(l/a) 1 5(m) | + log 



1 - 



log 



V 1 — a 



1NP+ 1, 



1 - a 



l-a K + 1 
l-a 



|5(m)| 



E ^ 



A' 



|S( mj )| 



Using the elementary inequality log < k log(ne/fc) and that a 1 £ (0, 1) 

since a < 1/2, 



XL(p m ,n) < XL(p m ,ir m ) + \S(m) 



log(l/a) +log 



pe 



|S(m)| 



E 1^)1 

jes(m) 



/ Ae 



log 



1-q 
1 - 2a, 



Guedj and Alquier/PAC-Bayesian Sparse Additive Prediction 



24 



Thus with P-probability at least 1 — 2s, 



RN>) - RU*) < D inf inf inf I RNjg) + ( 2 - RM;*) 



pog(C70+log(J0] E 15(1^)1+ log J + |S(m)| log 



ies(m) 

Hence with P-probability at least 1 — 2s, 



|5(m)| 



, < D inf inf <^ i?(V-e) - R(ip*) 
R(iP A ) - Rty*) J " meM eesL(o,c) ' 



|g(m)| log( P /i^(m)|) + logW E |5(mj)| + log(l/ £ ) 



+ 

' n n * — ' ■ - — ■ n 

jeS(m) 

where D is a numerical constant depending upon w, a, C, I and a. □ 
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